Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-91524: Speed up the regular expression substitution #91525

Merged
merged 11 commits into from
Oct 23, 2022

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Apr 14, 2022

Functions re.sub() and re.subn() and corresponding re.Pattern methods
are now 2-3 times faster for replacement strings containing group references.

Closes #91524

Functions re.sub() and re.subn() and corresponding re.Pattern methods
are now 2-3 times faster for replacement strings containing group references.
Modules/_sre/sre.c Outdated Show resolved Hide resolved
Modules/_sre/sre.c Show resolved Hide resolved
Modules/_sre/sre.c Show resolved Hide resolved
Modules/_sre/sre.c Show resolved Hide resolved
@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@ghost
Copy link

ghost commented Apr 15, 2022

If use _PyUnicodeWriter/_PyBytesWriter, will it be faster?

There are two data contracts that crossing functions:

  1. between _parser.parse_template() and _sre.template(). List must be [literal, (group, literal)*]
  2. between _sre.template() and expand_template(). chunks = 1 + sum(1+bool(item.literal!=NULL) for item in items)

Such data contracts increase the difficulty of code maintenance. If use _PyUnicodeWriter/_PyBytesWriter, these data contracts are not necessary. And change TemplateObject like this:

typedef struct {
    PyObject_VAR_HEAD
    struct {
        int type; // literal or group
        PyObject *item;
    } items[0];
} TemplateObject;

In the current PR, adjacent groups also waste space.

Lib/re/__init__.py Show resolved Hide resolved
Lib/re/_constants.py Outdated Show resolved Hide resolved
Modules/_sre/sre.c Outdated Show resolved Hide resolved
Modules/_sre/sre.c Outdated Show resolved Hide resolved
Modules/_sre/sre.c Outdated Show resolved Hide resolved
Lib/re/__init__.py Show resolved Hide resolved
Lib/re/_constants.py Outdated Show resolved Hide resolved
Lib/re/_parser.py Show resolved Hide resolved
Modules/_sre/sre.c Outdated Show resolved Hide resolved
Modules/_sre/sre.c Show resolved Hide resolved
Modules/_sre/sre.c Show resolved Hide resolved
Modules/_sre/sre.c Outdated Show resolved Hide resolved
Modules/_sre/sre.c Outdated Show resolved Hide resolved
Modules/_sre/sre.c Show resolved Hide resolved
@serhiy-storchaka
Copy link
Member Author

If use _PyUnicodeWriter/_PyBytesWriter, will it be faster?

I do not think so. The benefit of _PyUnicodeWriter/_PyBytesWriter is that we do not need to allocate an intermediate list and resize it multiple times if the number of items is not known. But in our case it is known, and in common case the buffer allocated on the stack is used. _PyUnicode_JoinArray should be faster because it knows ahead the length and the kind of the result string. _PyUnicodeWriter can perform several memory re-allocations.

I may try to experiment with _PyBytesWriter later. It will complicate the code, so I am not sure it is worth it.

@serhiy-storchaka
Copy link
Member Author

Thank you for review. The conditions of my sight make my to easily miss some details.

I have made the requested changes; please review again.

@bedevere-bot
Copy link

Thanks for making the requested changes!

@gpshead: please review the changes made to this pull request.

@bedevere-bot bedevere-bot requested a review from gpshead April 15, 2022 06:47
@serhiy-storchaka
Copy link
Member Author

Until there is a clear answer, I plan to suspend work with Serhiy Storchaka.

Please do not do this. Your help compensates my disadvantage.

@ghost
Copy link

ghost commented Apr 15, 2022

It doesn't matter, the worst result is to postpone it to a later version.

There are a few possible changes about re module before 3.11 Beta1, #91495, #32411, #91477. Maybe you have other issues to deal with.
I'm afraid if you try too hard, it may be dangerous to your health.

@gpshead
Copy link
Member

gpshead commented Apr 16, 2022

@animalize - While it is great you are concerned for others health, it comes across poorly to say things in an "ultimatum" style such as a "threat" to stop working with someone because they've implied they may have a health concern. This is not kind. In many cultures (including Python's) that conduct is seen as discrimination. It limits others opportunities based on your own presumption of their abilities.

Among committers (which @serhiy-storchaka is a long time member of) none of us expect the steering council to micromanage our lives and most of us would rightfully reject that concept if we were to try. That isn't our purpose. While asking me or the SC to treat another committer specially was presumably voiced out of genuine concern here, it can come across as a put down that belittles their work even if this is not what you intended. We trust everyone to understand their own contribution abilities and act accordingly as they see fit.

Code review collaboration is a way for all of us to fill in gaps in that we all have in our work. Nobody is perfect.

(I'm replying with the above in this forum so that this response is visible to the same audience as the earlier messages that led to it.)

Lets ultimately stay focused on the PR here. Thanks for your reviews!

@ghost
Copy link

ghost commented Apr 17, 2022

Sorry. I don't know what to say.

I tried to improve this PR locally, IMHO the data contracts and the buffer[10] optimization make the code harder to understand. (Now I'm working on #91616)

@ghost
Copy link

ghost commented Apr 19, 2022

I'll send a PR to serhiy-storchaka:re-compile-template later.

@eendebakpt
Copy link
Contributor

@serhiy-storchaka This PR has been approved by one other core dev, but not merged for some reason. The performance improvement seems worthwhile. Is there a reason why the branch has not been updated? If needed I can help to update the branch to current main

@gpshead gpshead merged commit 75a6fad into python:main Oct 23, 2022
@gpshead
Copy link
Member

gpshead commented Oct 23, 2022

Thanks Serhiy & reviewers!

@serhiy-storchaka serhiy-storchaka deleted the re-compile-template branch October 24, 2022 05:55
@serhiy-storchaka
Copy link
Member Author

Thank you Gregory for the final polishing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-regex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up regular expression substitution
6 participants